本文研究了动画视频的现实世界视频超分辨率(VSR)的问题,并揭示了实用动画VSR的三个关键改进。首先,最近的现实世界超分辨率方法通常依赖于使用基本运算符的降解模拟,而没有任何学习能力,例如模糊,噪声和压缩。在这项工作中,我们建议从真正的低质量动画视频中学习此类基本操作员,并将学习的操作员纳入降级生成管道中。这样的基于神经网络的基本操作员可以帮助更好地捕获实际降解的分布。其次,大规模的高质量动画视频数据集AVC构建,以促进动画VSR的全面培训和评估。第三,我们进一步研究了有效的多尺度网络结构。它利用单向复发网络的效率以及基于滑动窗口的方法的有效性。多亏了上述精致的设计,我们的方法Animesr能够有效,有效地恢复现实世界中的低质量动画视频,从而实现优于以前的最先进方法。
translated by 谷歌翻译
交互式图像恢复旨在通过调整几个控制系数来恢复图像,从而确定恢复强度。现有方法在学习已知降解类型和级别的监督下学习可控功能受到限制。当真正的降解与假设不同时,它们通常会遭受严重的性能下降。这样的限制是由于现实世界下降的复杂性,无法在培训期间对交互式调制提供明确的监督。但是,尚未研究如何实现现实世界中超级分辨率中的交互式调制。在这项工作中,我们提出了基于公制的实现现实世界超级分辨率(MM-REALSR)的交互式调制。具体而言,我们提出了一种无监督的退化估计策略,以估计现实情况下的降解水平。我们提出了一种度量学习策略,而不是将已知的降解水平作为对互动机制的明确监督,而是提出了一种度量策略,以将现实世界情景中的不可量化的降解水平映射到公制空间,该度量空间以不受监督的方式进行培训。此外,我们在度量学习过程中引入了锚点策略,以使度量空间的分布正常化。广泛的实验表明,所提出的MM-REALSR在现实世界中的超级分辨率中实现了出色的调制和恢复性能。代码可在https://github.com/tencentarc/mm-realsr上找到。
translated by 谷歌翻译
近年来,着色吸引了越来越多的兴趣。经典的基于参考的方法通常依靠外部颜色图像来获得合理的结果。检索此类示例不可避免地需要大型图像数据库或在线搜索引擎。最近的基于深度学习的方法可以自动以低成本为图像着色。但是,总是伴随着不满意的文物和不连贯的颜色。在这项工作中,我们提出了GCP颜色化,以利用预审前的生成对抗网络(GAN)封装的丰富和多样化的颜色先验进行自动着色。具体而言,我们首先通过GAN编码器“检索”匹配的功能(类似于示例),然后将这些功能与功能调制量合并到着色过程中。得益于强大的生成颜色先验(GCP)和精致的设计,我们的GCP颜色可以通过单个前向传球产生生动的颜色。此外,通过修改GAN潜在代码获得多样化的结果非常方便。 GCP颜色还继承了可解释的gan的功能,并可以通过穿过甘恩潜在空间来实现可控制和平滑的过渡。广泛的实验和用户研究表明,GCP颜色比以前的作品具有出色的性能。代码可在https://github.com/tothebeginning/gcp-colorization上找到。
translated by 谷歌翻译
最近无人驾驶飞行器(UAV)已广泛部署在各种真实的场景,如灾难救援和包裹交付。这些工作环境中的许多都是不确定和动态障碍的非结构化。保持UAV碰撞经常发生。非常希望具有高灵敏度的无人机,以调整其用于适应这些环境动态的动作。但是,无人机敏捷性受其电池电量输出的限制;特别是,UAV的电力系统不能知道其在运动规划中的实际功率需求,而需要随着环境和UAV条件而动态变化。在运动规划中,难以准确地对准电源需求的电源。这种不匹配会导致无人机的电源不足,并导致延迟运动调整,在很大程度上增加了障碍物的碰撞风险,因此破坏了无人机敏捷性。为提高无人机敏捷性,开发了一种新颖的智能电源解决方案,敏捷增强电源(AEPS),以主动准备适当的电量,以支持具有增强敏捷性的运动规划。该方法在物理电力系统和UAV规划之间构建了一座桥梁。凭借敏捷增强的运动规划,将提高复杂工作环境中的UAV的安全性。为了评估AEPS有效性,采用了“社区安全巡逻任务”的任务,采用了意外障碍;通过燃料电池,电池和电容器的混合集成来实现电源。通过成功和及时的电源,提高任务成功率和系统安全性,验证了AEP在提高无人机敏捷性方面的有效性,提高了任务持续时间。
translated by 谷歌翻译
A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks 1st and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. Code will be available at https://github.com/dvlab-research/Imbalanced-Learning.
translated by 谷歌翻译
Weakly-supervised object localization aims to indicate the category as well as the scope of an object in an image given only the image-level labels. Most of the existing works are based on Class Activation Mapping (CAM) and endeavor to enlarge the discriminative area inside the activation map to perceive the whole object, yet ignore the co-occurrence confounder of the object and context (e.g., fish and water), which makes the model inspection hard to distinguish object boundaries. Besides, the use of CAM also brings a dilemma problem that the classification and localization always suffer from a performance gap and can not reach their highest accuracy simultaneously. In this paper, we propose a casual knowledge distillation method, dubbed KD-CI-CAM, to address these two under-explored issues in one go. More specifically, we tackle the co-occurrence context confounder problem via causal intervention (CI), which explores the causalities among image features, contexts, and categories to eliminate the biased object-context entanglement in the class activation maps. Based on the de-biased object feature, we additionally propose a multi-teacher causal distillation framework to balance the absorption of classification knowledge and localization knowledge during model training. Extensive experiments on several benchmarks demonstrate the effectiveness of KD-CI-CAM in learning clear object boundaries from confounding contexts and addressing the dilemma problem between classification and localization performance.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Nearest-Neighbor (NN) classification has been proven as a simple and effective approach for few-shot learning. The query data can be classified efficiently by finding the nearest support class based on features extracted by pretrained deep models. However, NN-based methods are sensitive to the data distribution and may produce false prediction if the samples in the support set happen to lie around the distribution boundary of different classes. To solve this issue, we present P3DC-Shot, an improved nearest-neighbor based few-shot classification method empowered by prior-driven data calibration. Inspired by the distribution calibration technique which utilizes the distribution or statistics of the base classes to calibrate the data for few-shot tasks, we propose a novel discrete data calibration operation which is more suitable for NN-based few-shot classification. Specifically, we treat the prototypes representing each base class as priors and calibrate each support data based on its similarity to different base prototypes. Then, we perform NN classification using these discretely calibrated support data. Results from extensive experiments on various datasets show our efficient non-learning based method can outperform or at least comparable to SOTA methods which need additional learning steps.
translated by 谷歌翻译
In this tutorial paper, we look into the evolution and prospect of network architecture and propose a novel conceptual architecture for the 6th generation (6G) networks. The proposed architecture has two key elements, i.e., holistic network virtualization and pervasive artificial intelligence (AI). The holistic network virtualization consists of network slicing and digital twin, from the aspects of service provision and service demand, respectively, to incorporate service-centric and user-centric networking. The pervasive network intelligence integrates AI into future networks from the perspectives of networking for AI and AI for networking, respectively. Building on holistic network virtualization and pervasive network intelligence, the proposed architecture can facilitate three types of interplay, i.e., the interplay between digital twin and network slicing paradigms, between model-driven and data-driven methods for network management, and between virtualization and AI, to maximize the flexibility, scalability, adaptivity, and intelligence for 6G networks. We also identify challenges and open issues related to the proposed architecture. By providing our vision, we aim to inspire further discussions and developments on the potential architecture of 6G.
translated by 谷歌翻译